Evaluating Model Robustness to Image Quality Degradation in Histological Classification

Image 14

Published

June 5, 2027

Code
library(readr)
library(dplyr)
library(ggplot2)
library(purrr)
library(stringr)
library(cowplot)

metrics <- read_csv("../metrics/combined_report_metrics.csv")

1 Executive Summary

Computer vision models are increasingly used in pathology, often achieving performances comparable to expert clinicians. However, these models are typically trained on high-quality images that are costly to produce, while real-world clinical data is often lower in quality. This study evaluates model performance under varying levels of image degradation to identify architectures that are more robust to realistic, lower quality inputs and inform clinical expectations and decisions in practice.

The impact of focus-related artefacts in Whole Slide Imaging (WSI)—specifically blur and noise—on the classification of breast tissue cells was the focus of this experiment. To simulate the low-quality test data often encountered in clinical settings, Gaussian blur and noise were applied to images during testing. Four models were evaluated: ResNet50 (pretrained on ImageNet), a custom CNN, Random Forest, and XGBoost. Deep learning models used normalised inputs to learn local features, while machine learning models applied PCA to capture global patterns.

All models experienced decreased performance with image degradation, however machine learning models were the most robust. They showed negligible drops with noise and only ~10% accuracy loss at extreme blur levels. Notably, both consistently detected >80% of tumours and kept a relatively high precision for immune cells, showing potential for utility as screening tools even if overall accuracy drops.. On the other hand, deep learning models, while initially the most accurate, deteriorated rapidly with augmentation—often defaulting to one or two classes.

ResNet50 remained optimal on high-quality data (≥70% accuracy), but XGBoost demonstrated the best performance under degraded conditions, with stable accuracy and high recall for tumour cells. These results highlight the need to match model choice to the expected quality of clinical input data.

To support clinical interpretation, a Shiny application was developed to visualise the effects of image degradation and model performance at each augmentation level, both overall and by class. The app also enables pathologists to upload their own images and observe model predictions across varying levels of degradation, providing insights into prediction stability, and the most appropriate model for their image quality.

The experiment was designed for full reproducibility. All code and implementation instructions are available at https://github.com/AlanS812/data3888-14. Figure 1 outlines the experimental workflow, with corresponding Python scripts indicated for clarity and ease of use.

Figure 1: Diagram illustrating the experimental workflow and deployment pipeline. Each step is annotated with the corresponding Python script used, enabling full reproducibility of the process.

2 Background

The medical field is increasingly turning to Computer Vision models for classification tasks, often distinguishing cell types. With rapid increases in the field, models have come to accuracies of up to 98.5%1, slightly above that of an experienced human pathologist2. The question becomes, can these models maintain their performance in day-to-day practice? In real life, cell images do not always have perfect quality. Investigating how drops in image quality impact classification performance will give medical imagers a better view of how these models will perform on real data.

To address this issue, it is first necessary to determine the common causes and kinds of image degradation. These vary greatly depending on the situation, for example, motion blur is a typical challenge for MRIs3. This report, however, centers on histological H&E stained tissue slides, specifically of breast cancer tissue4. The major cause of quality issues in this case is ‘Whole Slide Imaging’ or WSI causing blur and noise.5

WSI is the common practice of scanning full microscope slides, rather than scanning section by section.5 It has reduced the processing time of a single slide to mere minutes, but there are trade-offs in image quality. Focal points are the points where the camera centers its focus, they are selected automatically or manually. As microscope slides are three-dimensional, if a focus point is selected on a region with different depth to the typical focus depth of the slide, its neighbouring areas will be out of focus. This will lead to blur issues, and more prominent noise due to the camera’s failure to fully capture high frequency information like texture and edges.6

There are ways to mitigate image quality drops: increasing the number of focal points will reduce the prominence of the issue, but not eradicate it entirely. Further, this will also increase processing time, inconveniencing patients and adding delays on overloaded labs.5 Imaging technology is advancing quickly, but with their prohibitive expense, the issue is likely to persist. Therefore, giving clinicians information on how this will impact their models, and what architectures are better suited to lower image quality is crucial to the success of medical image classification.

3 Method

Data was sourced from the Gene Expression Omnibus, a public repository of biomedical data. A Xenium Analyser was used to produce high resolution whole-slide images of breast tissue with a tumour present4, an alternative to the lower quality images typically generated by WSI. Individual cells were identified, cropped to an image, and labelled7. Though cell images with a pixel border of both 50 and 100 were provided, only the 100 image set was used due to their higher historical performance, and to balance computational constraints given the high volume of raw data provided8. All images originated from the same selected slide.

Images were grouped into classes based on their role in breast cancer progression and diagnosis. Tumour cells were grouped for their direct pathological relevance, immune cells for their similar functional responses to disease, and stromal cells for their structural role in the tumour environment. The “other” group included cells that didn’t clearly fit the main categories but were retained to help the model learn to distinguish diagnostically important cells from potentially less relevant ones.9,10

Each class was randomly and equally sampled to ensure the model learns from all biologically important categories, not just those most prevalent in the provided tissue. In real tissue, critical cells like tumour or immune types may be rare, so underrepresenting them could potentially weaken the model’s diagnostic ability. Unlabelled cell images were excluded to ensure consistent truth labels and avoid introducing noise into the training process.

To balance model performance with computational constraints, 20,000 of the given ~175,000 images were used, with 75% for training, 10% for validation, and 15% for testing. 3 test sets were used Although a typical data split allocates 80% for training, 10% for validation, and 10% for testing, the test set was increased to enhance the reliability of the final evaluation metrics. Three separate test sets were used with metrics averaged to assess model stability. This approach provides greater insight into variability across test sets, offering medical professionals a clearer understanding of model reliability in practice.Due to natural variation in cell size, cropped images were not uniform, and were therefore resized to 224×224 pixels using Lanczos downsampling (Duchon 1979)11 to meet ResNet50’s input requirements.

To assess the impact of reduced image quality on classification performance, two classical machine learning models and two deep learning models were trained to compare their robustness and find if one had an advantage over the other.

The selected models were chosen based on their compatibility with high-dimensional medical image data. CNNs and Imagenet trained ResNet were used for their ability to extract spatial features directly from pixel data, while Random Forest and XGBoost were selected for their robustness to high-dimensional, potentially redundant features, particularly when using HOG and PCA to make the feature space more compact and informative. All 4 have had particular success in medical image classification in previous studies1215. Models such as SVM and k-NN were avoided due to their difficulty with increased data scale and sensitivity to the high dimensional nature of image-based classification tasks.

Images were normalised with ImageNet parameters for both deep learning models to ensure comparability between results. For machine learning models, three methods of input were tested: raw pixels, Histogram of Gradients (HOG) and Principal Component Analysis (PCA).16 Raw pixels were too computationally expensive to scale, and HOG had poor results, likely due to the loss of intensity and colour information which is vital in stained images.

PCA gave the best performance while minimising space, converting 50,000 pixels to 100 components and accounting for ~60% of variance in the dataset. Images were flattened, then the principal components (PC) were fit to the training dataset. Testing images were linearly transformed with the pre-fit PC, not refit to the testing data to avoid data leakage. Each principal component can be visualised as a 224x224 image with some global feature and transformed images will give insight into how the model decides, allowing for better interpretability, vital for high-stakes medical decisions. [15]

Models were tested using the same testing set at different augmentation levels and combinations of Gaussian blur and noise to simulate WSI damage6. Initially both were tested at small increments, which were increased once a pattern emerged to manage computation. Blur was tested on kernel sizes from 0 to 19, and noise from 0 to 30, this provided a full range of high quality to completely degraded images.

To evaluate overall performance, metrics of accuracy, confusion matrices, average maximum confidence and weighted F1, precision, recall scores were taken. For a per class breakdown, metrics of precision, recall, f1, average confidence and standard deviation and prediction count were taken.

4 Results

4.1 Machine vs Deep Learning

Code
blur_plot_data <- metrics %>%
  select(test_set, blur_size, noise_level, accuracy, Model_Label) %>%
  filter(Model_Label %in% c("RF (PCA)", "XGBoost (PCA)", "CNN", "ResNet")) %>%
  group_by(blur_size, noise_level, Model_Label) %>%
  summarise(accuracy = mean(accuracy), .groups = "drop") %>%
  mutate(
    noise_level = as.factor(noise_level),
    Model_Label  = factor(Model_Label)
  )

print(ggplot(blur_plot_data, aes(x = blur_size, y = accuracy, color = noise_level, group = noise_level)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  facet_wrap(~ Model_Label, ncol = 2) +
  labs(
    x = "Blur Radius",
    y = "Accuracy",
    color = "Noise Level"
  ) +
    scale_color_brewer(palette = "Blues") +
  theme_minimal(base_size = 13) +
  theme(
    strip.text = element_text(face = "bold", size = 12),
    legend.position = "right"
  ))
Figure 2: Accuracy vs Blur Radius by Noise Level, faceted by Model.

As shown in Figure 2, deep learning models perform best on unaugmented images, but accuracy drops sharply with any augmentation—except for precision. In contrast, the machine learning models maintain relatively stable accuracy, F1, and weighted recall across augmentation levels, and mostly stable precision.

Precision in deep learning exceptionally stays stable up to high levels of blur provided no noise is applied. Once noise is applied, only immune and stromal are predicted. When blur is applied, it predicts ‘other’ and ‘tumour’ extremely rarely, artificially inflating the average precision.

4.1.1 Deep Learning

Noise causes the sharpest drop in accuracy for the CNN and ResNet50 models, with ResNet50 dropping to 30% after noise reaches a standard deviation of 1. The CNN tolerates noise up to level 3 before dropping sharply. Both models are highly effective in extracting spatial features from data - relating the information of several neighbouring pixels rather than individual values. Due to these close pixel relationships in the CNNs, it makes sense that slight changes to each pixel’s colour - and subsequently all surrounding pixels - would have a significant impact on the model’s performance. Mayer et al. (2022)17 similarly found that denoising software improved CNN performance.

Blur also caused a rapid performance drop, especially beyond kernel sizes of 9 pixels—though less severe than noise. This result is not unsurprising as even to the human eye, this level of blurring appears to be significant. This is consistent with the performance seen above with Gaussian noise applied, however to a less extreme extent as neighbouring pixels are not being altered using the same normally random distribution, rather as a function of their neighbouring colour channels. As a result the impact of blurring is less intense at low levels than noise, with only minor performance decreases.

Several other studies have similar findings, showing that models trained using sharp images struggle to generalise with images that have been blurred. Jang & Tong (2021)18 interestingly tested the effects of training CNNs using blurred images originally, then sharp images later and found that these models consistently performed better than those that were trained using only sharp images (as we have done in our investigation).

4.1.2 Machine Learning

Figure 3: Original and augmented images reconstructed after PCA dimension reductio, processed in pca.py.

On the other hand, although starting at lower accuracies the machine learning models are more robust to drops in quality. Both models have limited drops with a blur kernel size of up to 5 and hardly any impact from noise levels of up to 30.

The robustness of these models is most likely due to the Principal Component dimension reduction. This makes intuitive sense once visualised in Figure 3, it is clear that once the principal component transformation has been applied, the images appear blurred, and effectively lose their noise. As the Principal Components (PCs) are global - capturing the whole image, noise will simply be discarded. Comparatively, a CNN based model will examine local patterns which can be distorted by localised noise. Notably, PCA is often used as a denoising technique19, which explains its robustness to noise in classification tasks.

The machine learning models are slightly less robust to blur. As blur increases, the linear transformation can detect less information. However, even at extreme levels where it is uninterpretable to the human eye, the model can still extract colour, spatial and intensity information. Again, due to its global nature it is able to capture larger scale features, so it can continue extracting relevant patterns even with limited local information.

Thus, XGBoost with PCA is best for low-quality data; ResNet50 performs best on high-quality inputs.

4.2 Class Breakdown

Code
# need to fix caption here 

# parse confusion matrices
parse_cm <- function(cm_str) {
  nums <- as.numeric(unlist(str_extract_all(cm_str, "\\d+")))
  matrix(nums, nrow = 4, byrow = TRUE)
}

# average across test sets
get_avg_cm <- function(df, model_name, blur_val, noise_val) {
  df %>%
    filter(Model_Label == model_name, blur_size == blur_val, noise_level == noise_val) %>%
    pull(confusion_matrix) %>%
    map(parse_cm) %>%
    reduce(`+`) %>%
    `/`(3)  # divide by 3 test sets
}

plot_cm <- function(cm, title) {
  df <- as.data.frame(as.table(cm))
  colnames(df) <- c("True", "Predicted", "Freq")
  df$True      <- factor(as.integer(df$True),
                         levels=1:4,
                         labels=c("Immune","Other","Stromal","Tumour"))
  df$Predicted <- factor(as.integer(df$Predicted),
                         levels=1:4,
                         labels=c("Immune","Other","Stromal","Tumour"))
  ggplot(df, aes(x=Predicted, y=True, fill=Freq)) +
    geom_tile(color="white") +
    geom_text(aes(label=round(Freq, 1)), size = 4) +
    scale_fill_gradient(low="white", high="dodgerblue") +
    coord_fixed() +
    labs(title=title) +
        theme_minimal(base_size = 12) +
    theme(
      axis.title       = element_blank(),
      legend.position  = "none",
      plot.title       = element_text(hjust=0.5,
                                      size=12,
                                      face="bold",
                                      margin=margin(b=5)),
      plot.margin      = margin(t=5, r=5, b=5, l=5),

      #nlarge & rotate x‐labels so they don't collide
      axis.text.x      = element_text(
                           size = 10,
                           angle = 45,
                           hjust = 1,
                           vjust = 1,
                           margin = margin(t = 5)
                         ),
      #give y‐labels a bit more breathing room
      axis.text.y      = element_text(
                           size = 10,
                           margin = margin(r = 5)
                         )
    )
}

# Generate plots for all models and both augmentation levels
p1 <- plot_cm(get_avg_cm(metrics, "XGBoost (PCA)", 0, 0), "XGBoost – No Augmentation")
p2 <- plot_cm(get_avg_cm(metrics, "XGBoost (PCA)", 19, 30), "XGBoost – Max Augmentation")

p3 <- plot_cm(get_avg_cm(metrics, "ResNet", 0, 0), "ResNet – No Augmentation")
p4 <- plot_cm(get_avg_cm(metrics, "ResNet", 19, 30), "ResNet – Max Augmentation")

combined1 <- plot_grid(
  p1, p3, p2, p4,
  nrow = 2,
  align = "hv",
  axis  = "tblr"
)

# axis labels
labeled1 <- add_sub(combined1, "Predicted Class", vpadding = grid::unit(1, "lines"))
labeled1 <- ggdraw(labeled1) +
  draw_label("True Class", angle = 90, x = 0, y = 0.5, vjust = 1.5)

labeled1
Figure 4: True and Predicted label predictions averaged across test sets.

Looking to Figure 4, following image augmentation, ResNet50 predominantly predicted immune and stromal classes, while the CNN defaulted almost exclusively to stromal. Machine learning models initially achieved high precision for the immune class (~80%), which declined to ~60% under full augmentation, and tended to predict stromal and tumour. The ‘other’ class was rarely predicted by any model, likely due to its definition based on function rather than consistent visual features.

Computer vision models are increasingly used in pathology, often achieving performances comparable to expert clinicians. However, these models are typically trained on high-quality images that are costly to produce, while real-world clinical data is often lower in quality. This study evaluates model performance under varying levels of image degradation to identify architectures that are more robust to realistic, lower quality inputs and inform clinical expectations and decisions in practice.

The impact of focus-related artefacts in Whole Slide Imaging (WSI)—specifically blur and noise—on the classification of breast tissue cells was the focus of this experiment. To simulate the low-quality test data often encountered in clinical settings, Gaussian blur and noise were applied to images during testing. Four models were evaluated: ResNet50 (pretrained on ImageNet), a custom CNN, Random Forest, and XGBoost. Deep learning models used normalised inputs to learn local features, while machine learning models applied PCA to capture global patterns.

All models experienced decreased performance with image degradation, however machine learning models were the most robust. They showed negligible drops with noise and only ~10% accuracy loss at extreme blur levels. Notably, both consistently detected >80% of tumours and kept a relatively high precision for immune cells, showing potential for utility as screening tools even if overall accuracy drops.. On the other hand, deep learning models, while initially the most accurate, deteriorated rapidly with augmentation—often defaulting to one or two classes.

ResNet50 remained optimal on high-quality data (≥70% accuracy), but XGBoost demonstrated the best performance under degraded conditions, with stable accuracy and high recall for tumour cells. These results highlight the need to match model choice to the expected quality of clinical input data.

To support clinical interpretation, a Shiny application was developed to visualise the effects of image degradation and model performance at each augmentation level, both overall and by class. The app also enables pathologists to upload their own images and observe model predictions across varying levels of degradation, providing insights into prediction stability, and the most appropriate model for their image quality.

The experiment was designed for full reproducibility. All code and implementation instructions are available at https://github.com/AlanS812/data3888-14. The {fig 1} below outlines the experimental workflow, with corresponding Python scripts indicated for clarity and ease of use.

5 Application Development

The application is designed to complement the real-world workflows of medical imaging professionals and diagnosticians, where visual assessment is central. By pairing augmented images with model performance metrics, it bridges the gap between human judgement and machine classification, supporting interdisciplinary decision-making.

The backend integrates two key pipelines: a pre-computed (but dynamically updatable) metrics pipeline, and pre-trained models that allow new predictions from user-uploaded images.

Page 1 displays example cell images from each class under user-selected blur and noise levels, alongside graphical performance summaries to allow quick model comparison under the augmentations applied.

Page 2 provides per-model class-level performance, including confidence and confusion matrices, supporting evaluation of model reliability and cell-type-specific performance as required.

Page 3 allows users to upload an image and view predictions across all augmentation combinations, helping assess model behaviour under novel, real-world image conditions.ns.

6 Discussion and Limitations

Blur Simulation: Blur was uniformly applied across images, unlike real-world cases where focus artefacts vary spatially. As image degradation in practice is more complex than simple augmentations, our simulation may not fully reflect real-world conditions. Future work could implement spatially localised or random blur for more realistic degradation.

Data Size: Due to computational limits, only a subset of data was used. Scaling to the full dataset may enhance accuracy and robustness, especially for deep learning models.

Sampling Strategy: Although class labels were balanced, cell subtypes were not, potentially introducing bias. More granular sampling could improve representation.

Domain Generalisability: The study focused on breast tissue. Results may not generalise to other tissues or modalities with different degradation patterns. Broader testing is needed to assess robustness.

PCA and Deep Learning Integration: PCA boosted robustness in machine learning models. Incorporating PCA-reconstructed inputs into deep learning may balance CNN accuracy with global feature stability.

Model Development: Augmentations were test-time only. Training with them could improve resilience. Techniques like denoising or deblurring might also help counter degradation effects.

7 Conclusion

Accurate identification of cell types in histological breast tissue is essential for effective, early cancer diagnosis and treatment planning. However, real-world slides often suffer from quality issues, such as the blur and noise introduced by WSI, a critical consideration when needing to detect the presence of critical classes like tumour cells.

This study evaluated four classification models across varying levels of image degradation, simulated with blur and noise. They were tested on a four-class problem, distinguishing breast tissue cells based on their function in cancer. It was found that image quality had a considerable effect on both machine and deep learning model performance. While our ResNet had the best performance on unaugmented images, PCA based machine learning models achieved stable and predictable performances even under severe image quality degradation. This suggests the potential for use of simpler machine learning models in instances where histological images have low quality, but the importance of correct cell type identification remains high.

This exploration underscores the importance of aligning the choice of model with the expected image quality in real-world workflows of pathologists and histology-based diagnosticians. To support this, an interactive Shiny application was developed to visualise how models react to different levels of image quality, so that medical professionals can make an informed choice of model based on their image quality and diagnostic priorities.

8 Student Contribution

Elise wrote the shiny app script, deployed a reproducibility pipeline, trained and evaluated RF, researched cell groupings. Emily trained XGBoost, wrote the evaluation file for Resnet, CNN, XGBoost, wrote helper files/functions for shiny, made the presentation slides and schematic. Elise and Emily collated previous research and wrote report and speech drafts, then edited and formatted into QMD. Jason wrote the processing file, trained ResNet, wrote the initial methodology and results drafts and assisted with some observations. Alan trained the CNN, made initial presentation slides and did research on methodology. Barry did research on background and a few other methodology areas. Ye trained initial KNN and CNN models

References

1.
2.
3.
Luo, S. et al. Motion blur detection in radiographs. SPIE 6914, 100440 (2008).
4.
5.
6.
Shakhawat, N. et al. Automatic quality evaluation of whole slide images for the practical use of whole slide imaging scanner. ITE Transactions on Media Technology and Applications 8, 252–268 (2020).
7.
Ghazanfar, S. Preprocessing code. (2025).
8.
9.
Fridman, P. et el. The immune contexture in human tumours: Impact on clinical outcome. Nature Reviews Cancer 12, (2012).
10.
Kalluri, R. The biology and function of fibroblasts in cancer. Nature Reviews Cancer 16, (2016).
11.
Duchon, C. E. Lanczos filtering in one and two dimensions. Journal of Applied Meteorology and Climatology 18, 1016–1022 (1979).
12.
13.
Xu, et al., Fu. ResNet and its application to medical image processing: Research progress and challenges. Comput Methods Programs Biomed (2023) doi:https://doi.org/10.1016/j.cmpb.2023.107660.
14.
15.
Shivajirao Jadhav, S. Y. &. Deep convolutional neural network based medical image classification for disease diagnosis. Journal of Big Data 6, (2019).
16.
Mudrova, M. & Prochazka, A. PRINCIPAL COMPONENT ANALYSIS IN IMAGE PROCESSING.
17.
18.
19.
Bakir, et al., Weston. Learning to find pre-images. in Advances in neural information processing systems 16 449–456 (Biologische Kybernetik; Max-Planck-Gesellschaft; MIT Press, 2004).
20.
Deng, J. et al. Imagenet: A large-scale hierarchical image database. in 2009 IEEE conference on computer vision and pattern recognition 248–255 (Ieee, 2009).
21.
22.
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 785–794 (ACM, 2016). doi:10.1145/2939672.2939785.
23.
Brieman, L. Random forests. Machine Learning 45, 5–32 (2001).
24.
O’Shea, K. & Nash, R. An introduction to convolutional neural networks. (2015).
25.